Skip to content

Conversation

d10c
Copy link
Contributor

@d10c d10c commented Aug 11, 2025

This PR adds overlay support to the Python extractor, but no overlay compilation (to be merged separately since it needs further testing, see this PR).

This PR also includes an initial pass at the discard predicates (see Overlay.qll), though these are ignored in full (non-overlay) evaluation; they probably still need to be tweaked, so I'm happy to move this commit to another PR and let this one be only about the extractor.

Roadmap:

  • Update the dbscheme
  • Implement path transformer support
  • Read the overlay-changes JSON file
  • Read/write base metadata (CODEQL_EXTRACTOR_<LANG>_OVERLAY_BASE_METADATA_{IN,OUT})

@d10c d10c force-pushed the d10c/python-overlay branch 2 times, most recently from b18b9ce to 3015c12 Compare August 12, 2025 10:48
@d10c d10c force-pushed the d10c/python-overlay branch from b0c7a52 to b5c8338 Compare August 19, 2025 18:20
@d10c d10c force-pushed the d10c/python-overlay branch from f75a392 to 63106c0 Compare August 20, 2025 14:32
@d10c d10c force-pushed the d10c/python-overlay branch from 63106c0 to b3a1ba5 Compare August 27, 2025 08:42
@d10c d10c force-pushed the d10c/python-overlay branch from b3a1ba5 to fb23977 Compare August 27, 2025 08:59
@d10c
Copy link
Contributor Author

d10c commented Aug 28, 2025

@tausbn I'm thinking this might be a good time to checkpoint this work and get it reviewed. In the last DCA run for full analysis on this PR (see above), overall analysis time is unaffected, though there are a few outstanding stage timing results that are probably noise.

@d10c d10c marked this pull request as ready for review August 28, 2025 10:33
@d10c d10c requested a review from a team as a code owner August 28, 2025 10:33
@d10c d10c requested review from Copilot and tausbn August 28, 2025 10:33
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds overlay support to the Python extractor by implementing infrastructure for incremental analysis through database overlays, without including overlay compilation functionality.

Key changes implemented:

  • Database schema updates to support overlay metadata and change tracking
  • Extractor modifications to handle overlay-specific file traversal and metadata management
  • Path transformer support using updated environment variables

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
python/ql/lib/semmlecode.python.dbscheme Adds databaseMetadata and overlayChangedFiles relations for overlay support
python/ql/lib/semmle/python/Overlay.qll Implements discard predicates to filter out obsolete entities during overlay analysis
python/extractor/semmle/traverser.py Modifies file traversal to only process changed files during overlay extraction
python/extractor/semmle/worker.py Adds support for writing base metadata output required for overlay operations
python/extractor/semmle/path_rename.py Updates path transformer to support new CODEQL_PATH_TRANSFORMER environment variable

with open(os.environ['CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'], 'r', encoding='utf-8') as f:
data = json.load(f)
changed_paths = data.get('changes', [])
self.overlay_changes = { os.path.abspath(p) for p in changed_paths }
Copy link
Preview

Copilot AI Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name self.overlay_changes is inconsistent with the other instance variables which use snake_case (self.exclude_paths, self.recurse_files, etc.). Consider renaming to self.overlay_changed_paths for consistency.

Copilot generated this review using guidance from repository custom instructions.

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks good. 👍

Do we have any tests for this? I feel like we might want to have a few CLI Integration tests to check that the overlay JSON files are being applied correctly. (The integration tests live here: https://github.com/github/codeql/tree/main/python/extractor/cli-integration-test)

Also, don't forget to update the extractor version here: https://github.com/github/codeql/blob/main/python/extractor/semmle/util.py#L13
(In this case, I think bumping it to 7.1.4 would be fine. We don't really have fixed rules for how to increase the version. The most important thing is that it changes so that we can tell from the log output what version of the extractor we're running.)

Comment on lines 36 to 38
if 'CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES' in os.environ:
with open(os.environ['CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'], 'r', encoding='utf-8') as f:
data = json.load(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm debating whether we should have some exception handling here (substituting the empty list of changed files in case something goes wrong). Currently, if something ends up being messed up in the JSON, then I believe the whole extraction will just fail.

I don't have strong feelings about it, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I also don't have strong opinions about whether file reading should fail loudly or warn and continue with a default (None, i.e. full extraction). I guess I'll go for the latter. And also insert a logger statement with the value of the environment variable, as is the convention elsewhere in the extractor.

@d10c
Copy link
Contributor Author

d10c commented Sep 5, 2025

Do we have any tests for this? I feel like we might want to have a few CLI Integration tests to check that the overlay JSON files are being applied correctly. (The integration tests live here: https://github.com/github/codeql/tree/main/python/extractor/cli-integration-test)

There are basic integration tests here but they depend on overlay compilation (not part of this commit), and also I'm still running into some issues on Windows (it appears that the path transformer is not working correctly there—currently debugging that). So maybe merging this should wait until I have that sorted.

Otherwise, do you have an idea for an integration test for this functionality that doesn't also exercise complete overlay evaluation?

And don't add slash to start of path patterns on Windows.
…t-in files

On Windows, we're getting e.g. the following mismatches, which could be due to case differences:

"Skipped built-in file C:\hostedtoolcache\windows\Python\3.13.7\x64\Lib\multiprocessing\forkserver.py"
vs "Extracted file C:\hostedtoolcache\windows\Python\3.13.7\x64\lib\asyncio\streams.py"
@d10c d10c force-pushed the d10c/python-overlay branch from fb23977 to f309dc6 Compare September 10, 2025 18:42
@d10c
Copy link
Contributor Author

d10c commented Sep 10, 2025

I think I've figured out why path transformers weren't working on Windows and why built-in modules were being extracted (see latest commits). Now the integration test on the other PR passes.

The only remaining thing now is solving some tuple count regressions uncovered through DCA, but that can be done independently of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants